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Abstract 

We develop a formalism to address statistical pattern recognition of graph valued data. Of particular 
interest is the case of all graphs having the same number of uniquely labeled vertices. When the vertex 
labels are latent, such graphs are called shuffled graphs. Our formalism provides insight to trivially 
answer a number of open statistical questions including: (i) under what conditions does shuffling the 
vertices degrade classification performance and (ii) do universally consistent graph classifiers exist? 
The answers to these questions lead to practical heuristic algorithms with state-of-the-art finite sample 
performance, in agreement with our theoretical asymptotics. 

1 Introduction 

Representing data as graphs is becoming increasingly popular, as technological progress facilitates measur- 
ing "connectedness" in a variety of domains, including social networks, trade-alliance networks, and brain 
networks. While the theory of pattern recognition is deep [1], previous theoretical efforts regarding pattern 
recognition almost invariably assumed data are collections of vectors. Here, we assume data are collections 
of graphs (where each graph is a set of vertices and a set of edges connecting the vertices). For some data 
sets, the vertices of the graphs are labeled, that is, one can identify the vertex of one graph with a vertex of 
the others (note that this is a special case of assuming vertices are labeled, where each vertex has a unique 
label). For others, the labels are unobserved and/or assumed to not exist. We investigate the theoretical and 
practical implications of the absence of vertex labels. 

These implications are especially important in the emerging field of "connectomics", the study of con- 
nections of the brain ||2j|3l. In connectomics, one represents the brain as a graph (a brain-graph), where 
vertices correspond to (groups of) neurons and edges correspond to connections between them. In the lower 
tiers of the evolutionary hierarchy (e.g., worms and flies), many neurons have been assigned labels [4|. 
However, for even the simplest vertebrates, vertex labels are mostly unavailable when vertices correspond 
to neurons. 

Classification of brain-graphs is therefore poised to become increasingly popular Although previous 
work has demonstrated some possible strategies of graph classification in both the labeled [5| and unlabeled 
||6 | scenaiios, relatively little work has compared the theoretical limitations of the two. We therefore develop 
a random graph model amenable to such theoretical investigations. The theoretical results lead to universally 
consistent graph classification algorithms, and practical approximations thereof. We demonstrate that the 
approximate algorithm has desirable finite sample properties via a real brain-graph classification problem of 
significant scientific interest: sex classification. 

2 Graph Classification Models 

2.1 A labeled graph classification model 

A labeled graph G = (V, 8) consists of a vertex set V, where | V| = n < oo is the number of vertices, and 
an edge set £, where \£\ < r? . 

Definition 1. Let G : ^ Gn be a labeled graph-valued random variable taking values G £ Qn, where Qn 
is the set of labeled graphs on n vertices. 
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The cardinality of Qn is super-exponential in n. For example, when all labeled graphs are assumed to be 
simple (that is, undirected binary edges without loops), then \Qn\ = 2(2) = dn- Let y be a categorical ran- 
dom variable, Y : — )• 3^ = {yo, . . . , yc}, where c < cxd. Assume the existence of a joint distribution, Pg.y 
which can be decomposed into the product of a class-conditional distribution (likelihood) P^iy and a class 
prior Try. Because n is finite, the class-conditional distributions P(Q|y=y = ^G,\y can be considered discrete 
distributions Discrete(G; where By is an element of the (i„ -dimensional unit simplex A^^ (satisfying 
eciy > VG G Gn and Y.Geg„ (^G\y = D- 

2.2 A shuffled graph classification model 

In the above, it was implicitly assumed that the vertex labels were observed. However, in certain situations 
(such as the motivating connectomics example presented in Section [T]), this assumption is unwarranted. To 
proceed, we define two graphs G,G' € Gnto be isomorphic if and only if there exists a vertex permutation 
(shuffle) function Q: Gn ^ Gn such that Q{G) = G' . Let Q be a permutation-valued random variable, 
Q: — Qn, where Q Yi is the space of vertex permutation functions on n vertices so that | Qn\ — fi^-- 

Definition 2. Let G' = Q(G) : ^ Gn be a shuffled graph-valued random variable, that is, a labeled 
graph valued random variable that has been passed through a random shuffle channel Q. 

Extending the above graph-classification model to include this vertex shuffling distribution yields PQ,G,y ■ 
We assume throughout this work (with loss of generality) that the shuffling distribution is both class inde- 
pendent and graph independent; therefore, this joint model can be decomposed as 

IPq.G.Y = P(qPg,Y = PQ^GIY'^i^ = '^Q{G)\Y'^Y- (1) 

As in the labeled case, the shuffled graph class-conditional distributions Pq(g)|j/ can be represented by 
discrete distributions Discrete(G; 9'y). Because Q(G) can be any of \Gn\ different graphs, it must be that 
9'y G Arf^. When Pq is uniform on Q„, all shuffled graphs within the same isomorphism set are equally 
likely; that is {6'^^^^ = O'^^^^yMGu Gj : Q{Gi) = Gj for some Q e Q„}. 

Note that one can think of a labeled graph as a shuffled graph for which Q is a point mass at Q = /, 
where / is the identity matrix. 

2.3 An unlabeled graph classification model 

The above shuffling view is natural whenever the vertices of the collection of graphs share a set of labels, 
but the labeling function is unknown. However, when the vertices of the collection of graphs have different 
labels, perhaps a different view is more natural. 

An unlabeled graph G is the collection of graphs isomorphic to one another, that is, G = {Q{G)}q,^q^. 
Let G be an element of the collection of graph isomorphism sets Gn- The number of unlabeled graphs on n 
vertices is = dn ~ dn/n\ (see [7| and references therein). An unlabeling function U : Gn ^ Gn ^ 
function that takes as input a graph and outputs the corresponding unlabeled graph. 

Definition 3. Let G = U (G) : 0, ^ Gn be an unlabeled graph-valued random variable, that is, a labeled 
graph-valued random variable that has been passed through an unlabeled channel. In other words, G = 
{Q(G)}QgQ,j, and takes values G G Gn- 

The joint distribution over unlabeled graphs and classes is therefore Pg ^ = P(/(G),y = IP'c/{Q(G)),y' 
which decomposes as Pg|y7ry. The class-conditional distributions Pg|^ over isomorphism sets (unlabeled 

graphs) can also be thought of as discrete distributions Discrete(G; 6y) where 6y G are vectors in the 

(in -dimensional unit simplex. Comparing shuffling and unlabeling for the independent and uniform shuffle 
distribution Pq, we have {9'^. = 9^. /\G\ for all G G G}. 
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3 Bayes Optimal Graph Classifiers 

We consider graph classification in the three scenarios described above: labeled, shuffled, and unlabeled. 
To proceed, in each scenario we define three mathematical objects: (i) a graph classifier, (ii) risk, (iii), the 
Bayes optimal classifier, and (iv) the Bayes risk. 

3.1 Bayes Optimal Colored Graph Classifiers 

A labeled graph classifier /i : t/„ — 3^ is any function that maps from labeled graph space to class space. 
The risk of a labeled graph classifier h under — 1 loss is the expected misclassification rate L{h) = 
E[h(G) 7^ Y\, where the expectation is taken against Pg.y- The labeled graph Bayes optimal classifier is 
given by 

/t* = argmin L(^), (2) 

hen 

where % is the set of possible labeled graph classifiers. The labeled graph Bayes risk is given by = 
L{h^), where L* implicitly depends on Pg,^. 

3.2 Bayes Optimal Shuffled Graph Classifiers 

A shuffled graph classifier is also any function h: Qn ^ y (note that the set of shuffled graphs is the same 
as the set of labeled graphs). However, by virtue of the input being a shuffled graph as opposed to a labeled 
graph, the shuffled risk under — 1 loss is given by L'{h) = E[h{Q{G)) / 1"], where the expectation is 
taken against IPQ((G),y . The shuffled graph Bayes optimal classifier is given by 

h'^ = argmin L'(/i), (3) 
hen 

where H is again the set of possible labeled (or shuffled) graph classifiers. The shuffled graph Bayes risk is 
given by L'^ = L{h'^), where L'^ implicitly depends on PQ(G),y. 

3.3 Bayes Optimal Unlabeled Graph Classifiers 

An unlabeled graph classifier h: Qn ^ y is ^ny function that maps from unlabeled graph space to class 
space. The risk under — 1 loss is given by L{h) = E[h{G) / Y], where the expectation is taken against 
y . The unlabeled graph Bayes optimal classifier is given by 

= argmin (4) 
hen 

The unlabeled graph Bayes risk is given by L* = L{h*), where H is the set of possible unlabeled graph 
classifiers and L* implicitly depends on P^ ^ . 

3.4 Parametric Graph Classifiers 

The three Bayes optimal graph classifiers can be written exphcitly in terms of their model parameters: 

(5) 
(6) 
(7) 





= argmax 








K{G) 


= argmax 






yey 




= argmax 


h\y^y 




yey 
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4 Under what conditions does shuffling the vertices degrade classification performance? 

The result of either shuffling or unlabeling a graph can only degrade, but not improve Bayes risk. This is a 
restatement of the data processing lemma for this scenario. Specifically, [IJ shows that the data processing 
lemma indicates that in the classification domain L*^ < L*j,(^x) ^'^y transformation T and data X. In our 
setting, this becomes: 

Lemma 1. < L^, = L'^. 

Proof. Assume for simplicity = 2 and ttq = vri = 1/2. 

Yl Yl ^G\y >YY1 ™i^^G|y = L,. (8) 



y J— r L y 

Geg„ GeG Gee„ GeG 



□ 



An immediate consequence of the above proof is that the inequality in the statement of Lemma [T] strict 
whenever the inequality in Eq. ^ is strict: 

Lemma 2. < L^, = L'^ if and only if there exists G such that 

GeG 



The above result demonstrates that even when the labels do carry some class-conditional signal, it may 
be the case that shuffling or unlabeling does not degrade performance. In other words, the following two 
statements are equivalent: (i) the labels contain information with regard to the classification task, and (ii) 
some graphs within an isomorphism set are class-conditionally more likely than others: ^OQ.^y ^ ^Gj\y 
where Q{Gi) = Gj for some Gj, Gj G Qn, Q £ Qn, and y £ y. Uniform shuffling has the effect of 
"flattening" likelihoods within isomorphism sets, from By to 9'y, so that 9'y satisfies {O'^^y = ^Q^y/\G\ V : G E 

G}. But just because the shuffling changes class-conditional likelihoods does not mean that Bayes risk must 
also change. This result follows immediately upon realizing that posteriors can change without classification 
performance changing. The above results are easily extended to consider non-equal class priors and c-class 
classification problems. To see this, ignoring ties, simply replace each minimum likelihood with a sum over 
all non-maximum posteriors: 

min6'G|y% ^ ^GlyT^y where y = {y: y / aigmax 6 c\y-n-y}. (9) 
^ yey ' 

Prior to concluding this section, we remark that one can achieve Bayes optimal risk using graph invari- 
ants. A graph invariant on Qn is any function ip such that ^p{G) = ip{Q{G)) for all G £ Qn and Q G Q„ 
(note that an unlabeling function U{G) is a special case of ip). A graph invariant classifier is a composi- 
tion of a classifier with an invariant function, = o ip. The Bayes optimal graph invariant classifier 
minimizes risk over all invariants: 

ht= argmin E[f{ilj{G))^Y], (10) 

where ^' is the space of all possible invariants and is the space of classifiers composable with invariant 
ip. The expectation in Eq. ( fTO] ) is taken against PG,y or equivalently IPq(g),y> since invariants are invariant. 
Let Lf denote the Bayes invariant risk. 
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Lemma 3. L* = . 

Proof. Let t/j indicate in which equivalence set G resides; that is, V'(G') = G if and only if G £ G. Then 

ht{G) = argmax^^(G)|j/Vry 

y&y 

= argmax 6*^1 vTy = /i*(G). (11) 

□ 

5 Do universally consistent graph classifiers exist? 

Throughout this paper, we consider two distinct "flavors" of graph classifiers that we can estimate from the 
data: (i) Bayes plug-in and (iii) nearest neighbor. Below, we first introduce Bayes plug-in graph classifiers. 
In the following sections, we will discuss their asymptotic properties, as well as the asymptotic properties 
of kg nearest neighbor graph classifiers. 

5.1 Bayes Plug-In Graph Classifiers 

Implementing the above optimal classifiers requires knowing the model parameters. When the parameters 
are unknown (effectively always), we assume that the data are sampled identically and independently from 
some unknown joint distribution: (Qj(Gi), Yi) *~ IPQ.G.y- For labeled graph classification, Pq is assumed 
to be the identity function, therefore, Ts = {(Gj, ^i)}t(E[s]' because when graphs are labeled Qi(Gj) = Gj. 
For shuffled graph classification Pq is assumed to be uniform over the permutation matrices, so that all label 
information is both unavailable and irrecoverable. The training data are therefore 77 = {(G'j, 
where G'j = Qi(Gi). For unlabeled graph classification the training data are again Tg. Our task is to utilize 
training data to induce a classifier that approximates a Bayes classifier as closely as possible. 

A labeled graph Bayes plugin classifier, hg : Gn x {G x y)^ y, estimates the parameters {9y, TTy}y(=.y 
using the training data Ts = {(Gj, ^i)}je[s]' then plugs those estimates into the labeled Bayes classifier, 
Eq. Q, resulting in 

hs{G) = argmax 9 Q^yTTy, (12) 

where the dependency on the training data is implicit in the hs{G) notation. 

A shuffled graph Bayes plugin classifier, h'g : GnX {G xy)^ ^ y, estimates the parameters {9y, 'n'y}yey 
using the training data Tg = {(G^, ^i)}ie[s]> then plugs those estimates into the shuffled Bayes classifier, 
Eq. Q, resulting in 

KiG) = argmax^q^vi-j,. (13) 

An unlabeled graph Bayes plugin classifier, hg : Gn x {Gn x yy — )• 3^, first determines in which 
unlabeled set each shuffled graph resides, using ip as defined in Section[5] Then, it estimates the parameters 
{^■4>{G')\y}yey ^iid {'^y}yey using the training data 77- Finally, it plugs those estimates into the unlabeled 
Bayes classifier, Eq. (|7]l, resulting in 

hg{G) = argmax^p^i 7r„. (14) 
yey 

For brevity, we will sometimes refer to the above three induced classifiers as simply "classifiers". More- 
over, the sequence of classifiers (for example, {hg}g^oo) we will also refer to as a "classifier". 
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5.2 Bayes Plug-in Graph Classifiers are Universally Consistent 



The three parametric classifiers, Eqs. (fT2|)-(fT4|), admit classifier estimators that exist, are unique, and more- 
over, are universally consistent, although the relative convergence rates and values that they converge to 
differ. 

Let Ls = L{hs) be the risk of the induced labeled graph Bayes plugin classifier using the training data 
Ts to obtain maximum likelihood estimators for {9y, iTy}y^y. Note that Ls is a random variable, as it is a 
function of the random training data Ts- This yields 

Lemma 4. Ls ^ L^ as s ^ oo. 

Proof. Because Qn and y are both finite, the maximum likelihood estimates for the categorical parameters 
{Oy,7ry}y,=^y are guaranteed to exist and be unique fl]. Hence, the labeled graph Bayes plugin classifier 
is universally consistent to L* (that is, it converges to L* regardless of the true joint distribution, PQ,G,y) 

m. ' '□ 

Similarly, let L's = L(h's) be the risk of the induced shuffled graph Bayes plugin classifier using the 
training data 7^' to obtain maximum likelihood estimators for {9y, TTy}yQy. This yields 

Corollary 1. L^ A L'^ as s ^ oo. 

Proof. The previous proof rests on the finitude of C/„, which remains finite after shuffling (uniform or oth- 
erwise), and therefore, the previous proof holds, replacing L* with L'^. □ 

Thus while one could merely plug the shuffled graphs into 6y, such a procedure is inadvisable. Specif- 
ically, the above procedure does not use the fact that all OQ/^y = OQi \y whenever Q{Gi) = Gj for some 

Q ^ Q. Instead, consider the risk Lg = L{hs) of the induced unlabeled graph Bayes plugin classifier upon 
using the if) function to map each shuffled graph to its corresponding unlabeled graph, and then obtaining 
maximum likelihood estimates of the unlabeled graph parameters, 6. 

Corollary 2. Ls ^ L^ as s ^ oo. 

Because \Qn\ ^ \Qn\ (by a factor of approximately n!), it follows that classifying by first projecting 
the graphs into a lower dimensional space should yield improved performance. Specifically, we have the 
following result: 

Lemma 5. hs dominates hg for shuffled graph data. 

Proof. Consider the scalar 9^^^ decomposed into the vector {0Gi\y, • • • i Gc^^^ly)^ where each Odly = ^g|j//I^I- 

Note that each 9Q.\y = (^Q.^y- Yet, the estimators, ^Gi|j/ ^^id G \y equal, because the former can bor- 

row strength from all shuffled graphs within the same unlabeled graph, but the latter does not. Assuming 
without loss of generality that the class priors are equal and known, the above domination claim is equivalent 
to stating that for each G, 

P[argmax6lG'|y / argmax 6*01^17;'] < P[argmax^^| / argmax(9^l J?;']. (15) 

y(^y yay ' y&y y^y 

Because 9Q\y = 9'^^^, the only difference between the two sides of the above inequality is the estimators. 
We know that the estimators have the following distributions: 

SQ9Q\y ~ Binomial(0G'|s/; (16a) 
SGOG\y ~ Binomial(^G|j;, sg), (16b) 
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where is the number of observations of any G € G in the training data, and sg is the number of obser- 
vations of G in the training data. From this, we see that for each G, 6Q\y will have a tighter concentration 
around the truth due to is borrowing strength, because > sq, so our result holds. □ 

5.3 ks Nearest Neighbor Graph Classifiers are Universally Consistent 



Corollary |2J demonstrates that one can induce a universally consistent classifier hs using Eq. ( [T4| ). Lemma 

[5] further shows that the performance of hs dominates h'^. Yet, using hg is practically useless for two 
reasons. First, it requires solving s graph isomorphism problems. Unfortunately, there are no algorithms 
for solving graph isomorphism problems with worst-case performance known to be in only polynomial time 
im. Second, the number of parameters to estimate is super-exponential in n (dn ~ 2" /n\), and acceptable 
performance will typically require s ^ dn. We can therefore not even store the parameter estimates for small 
graphs (e.g., n = 30), much less estimate them. This motivates consideration of an alternative strategy. 

A ks nearest-neighbor (A;NN) classifier using Euclidean norm distance is universally consistent to 
for vector- valued data as long as fc^ — 00 with ks/ s ^ a.s s ^ 00 (9). This non-parametric approach 
circumvents the need to estimate many parameters in high-dimensional settings such as graph-classification. 
The universal consistency proof for A^^NN was extended to graph-valued data in reference [10], which we in- 
clude here for completeness. Specifically, to compare labeled graphs, reference ifTOl considered a Frobenius 
norm distance 

6{G^,Gj) = \\Ai-Aj\\l, (17) 

where Ai is the adjacency matrix representation of the labeled graph, Gj. Let /if denote the Frobenius norm 
ks'NN classifier on labeled graphs using 6, and let Lf indicate the misclassification rate for this classifier. 
Reference ifTOl showed: 



Lemma 6. Lg ^ as s ^ 00. 

Proof. Because both Q and 3^ have finite cardinality, the law of large numbers ensures that eventually as 
s — )• 00, the plurality of nearest neighbors to a test graph will be identical to the test graph. □ 

Let kg denote the Frobenius norm fc^NN classifier on shuffled graphs using 6', and let L'f indicate 
the misclassification rate for this classifier. From the above lemma and Corollary [T] the below follows 
immediately: 

Corollary 3. L',^ A L'^ as s ^ 00. 

Given shuffled graph data 77, however, other distance metrics appear more "natural" to us. For example, 
consider the "graph-matched Frobenius norm" distance: 

5'(G^,G;.)= min \\Q{A'^-A%, (18) 

where A[ and A'j are shuffled adjacency matrices. Let h'f indicate the misclassification rate of the fcgNN 
classifier using the above graph-matched norm 6' shuffled graphs, and let L'f indicate the misclassification 
rate for this classifier. Given an exact graph matching function — a function that actually solves Eq. ( [T8] ) — we 
have the following result: 

Corollary 4. L'f 4 L'^ as s ^ 00. 
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Thus, given shuffled data 77, one could consider either or h'f. 

Interestingly, when the data are labeled graphs, 7^, one can outperform /if by shuffling, that is, by 
apparently destroying the label information. Consider an example in which 6 = 9', such that no information 
is in the labels. In such scenarios, shuffling can effectively borrow strength from different labeled graphs 
that are within the same unlabeled graph set. Let indicate the misclassification rate of the fc^NN classifier 
using 6' labeled graphs, and let indicate the misclassification rate for this classifier. We therefore state 
without proof: 

Lemma 7. Neither hf nor h\ dominates when data are labeled graphs. 

Thus, when the training data consists of shuffled graphs, the best universally consistent classifier (of 
those considered herein) is a fc^NN that uses 5' as the distance metric. Other universally consistent classifiers 
that we considered either require estimating more parameters than there are molecules in the universe, or 
are inadmissible under — 1 loss. When vertex labels are available, no classifier dominates. 

5.4 Comparing Asymptotic Performances 

The above theoretical results consider Bayes plug-in and A^^NN classifiers. Here we consider other classi- 
fiers. Specifically, let Lt be the misclassification rate for some classifier that operates on 7^', that is, only 
has access to shuffled graphs. Consider the set of seven graph invariants studied in ifTTl : size, max degree, 
max eigenvalue, scan statistic, number of triangles, and average path length. Via Monte Carlo, fTTl was un- 
able to find a uniformly most powerful graph invariant (test statistic [12J) for a particular hypothesis testing 
scenario with unlabeled graphs. The above results, however, indicate that there exists optimal classifiers (or 
test statistics) for any unlabeled or shuffled graph setting. To proceed, let /i^ be the chance classifier, that is 

h1{G) = argmaxTTy, (19) 

and let be the misclassification rate for this classifier. Moreover, let Lt be the risk of the invariant 
classifier that is equivalent to the unlabeled Bayes plug-in classifier (see Lemma|3]l. From the above results, 
it follows that: 

Lemma 8. In expectation, 

LI > L'^ = Ls = Lt = Lf = i'f ass^oo. 

5.5 Comparing Computational Properties 

While asymptotic results can be informative and insightful, understanding the computational properties of 
the different classifiers can be as (or even more) informative for real applications. Table [T] compares the 
space and time complexity of the various classifiers considered above. Only the fcNN classifiers have the 
property that they do not require more space than there are atoms in the universe (for any n bigger than 
^ 30). Of those, the labeled kgNN classifier does not require time exponential in the number of vertices. 
Therefore, we only found one type of classifier with performance guarantees that has both polynomial 
space and time. Unfortunately, the finite sample performance of this classifier is abysmal. This motivates 
constructing approximate classifiers. 

6 Real World Application 

We buttress the above theoretical results via numerical experiments. The asymptotic results combined with 
the computational complexities of the above described algorithm suggest that none of the proposed algo- 
rithms have all the properties we effectively require for real world applications, in particular, polynomial 
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Table 1 : Order of computational properties for training the various shuffled graph classifiers. 



name 


notation 


time 


space 


chance 
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invariant 
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labeled Bayes plug-in 
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s 


111111(2(2) ^ s) 


unlabeled Bayes plug-in 


hs 


e"s 


min(2(2) /n!, s) 


labeled /t^NN 


hi' 




2 

n s 


unshuffled kgNN 


h'f 


e"s2 


2 

n s 



space and time complexity, as well as reasonable convergence rates. We therefore propose a different al- 
gorithm, which lacks universal consistency, but can be run on real data with good hope for reasonable 
performance. In particular, we modify /i/ , the unshuffled fcNN classifier. Instead of requiring this clas- 
sifier to actually solve the graph matching problem, Eq. (J^l, we use a recently proposed state-of-the-art 
approximate cubic time algorithm [16J. Denote this classifier . 



6.1 Shuffled Connectome Classification 

A "connectome" is a brain-graph in which vertices correspond to (groups of) neurons, and edges correspond 
to connections between them. Diffusion Magnetic Resonance (MR) Imaging and related technologies are 
making the acquisition of MR connectomes routine 113]. 49 subjects from the Baltimore Longitudinal Study 
on Aging comprise this data, with acquisition and connectome inference details as reported in [i4J. Each 
connectome yields a 70 vertex simple graph (binary, symmetric, and hollow adjacency matrix). Associated 
with each graph is class label based on the sex of the individual (24 males, 25 females). Because the vertices 
are labeled, we can compare the results of having the labels and not having the labels. 
Consider the following five classifiers: 

• (^-INN: A 1-nearest neighbor (INN) with Frobenius norm distance on the labeled adjacency matrices. 

• (5'-lNN: A INN with Frobenius norm distance on the shuffled adjacency matrices. 

• (^-INN: A INN with an approximate graph-matched Frobenius norm distance on the shuffled adja- 
cency matrices, as described above. Because graph-matching is AAT^-hard lITSll . we instead use an 
inexact graph matching approach based on the quadratic assignment formulation described in |[T6l . 
which only requires 0{it') time. 

• '0-lNN: A INN with Euclidean distance using the seven graph invariants described above. Prior to 
computing the Euclidean distance, for each invariant, we rescale all the values to lie between zero and 
one. 

• tt: Use the chance classifier defined above. 

Performance is assessed by leave-one-out misclassification rate. 

Figure [T] reifies the above theoretical results in a particular finite sample regime. We apply the five 
algorithms discussed above to sub-samples of the connectome data, which shows approximate convergence 
rates for this data. Fortunately, this real data example supports the main lemmas of this work. Specifically, 
the fc^NN classifier using 5 on the labeled graphs (dashed gray line) achieves the lowest misclassification rate 
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for all s, which one would expect if labels contain appropriate class signal. Moreover, the kgNN classifier 
using the inexact graph-matching Frobenius norm on the shuffled adjacency matrices, 6, performs best of 
all classifiers using only shuffled graphs (compare dashed black line with solid black and gray lines). On 
the other hand, while the kgNN classifier using the Frobenius norm on shuffled graphs, 6', must eventually 
converge to L'^, its convergence rate is quite slow, so the classifier using standard invariants ip outperforms 
the simple 6' based /cgNN. 

Connectome Classifier Comparison 
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Figure 1: Connectome misclassification rates for various classifiers. 2000 Monte Carlo sub-samples of the 
data were performed for each s, such that errorbars were neglibly small. Five classifiers were compared, as 
described in main text. Note that when s is larger than 20, as predicted by theory, we have > It > 
Ll> Ll. Moreover, Lf' > Lf > Lf . 




7 Discussion 

In this work, we address both the theoretical and practical limitations of classifying shuffled graphs, relative 
to labeled and unlabeled graphs. Specifically, first we construct the notion of shuffled graphs and shuffled 
graph classifiers in a parallel fashion with labeled and unlabeled graphs/classifiers, as we were unable to 
find such notions in the literature. Then, we show that shuffling the vertex labels results in an irretrievable 
situation, with a possible degradation of classification performance (Lemma [T]). Even if the vertex labels 
contained class-conditional signal, Bayes performance may remain unchanged (Lemma [2]). Moreover, al- 
though one cannot recover the vertex labels, one can obtain a Bayes optimal classifier by solving a large 
number of graph isomorphism problems (Lemma [3]). This resolves a theoretical conundrum: is there a set 
of graph invariants that can yield a universally consistent graph classifier? When the generative distribution 
is unavailable, one can induce a consistent and efficient "unshuffling" classifier by using a graph-matching 
strategy (Corollary [2]l. While this unshuffling approach dominates the more naive approach (Lemma [5]l, it 
is intractable in practice due to the difficulty of graph matching and the large number of isomorphism sets. 
Instead, a Frobenius norm fc^NN classifier applied to the adjacency matrices may be used, which is also 
universally consistent (Corollary [4]l. Surprisingly, none of the considered classifiers dominate the other for 
labeled data (Lemma|7]l, yet asymptotically, we can order shuffled graph classifiers (Lemma[8]l. 

Because graph-matching is MV-hwA, we instead use an approximate graph-matching algorithm in prac- 
tice (see |[T6]| for details). Applying these fc^NN classifiers to a problem of considerable scientific interest — 
classifying human MR connectomes — we find that even with a relatively small sample size {s > 20), the 
approximately graph-matched kgNN algorithm performs nearly as well as the A^^NN algorithm using vertex 
labels, and slightly better than a k^NN algorithm applied to a set of graph invariants proposed previously 
ifTTl . This suggests that the asymptotics might apply even for very small sample sizes. Thus, this theoreti- 
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cal insight has led us to improved practical classification performance. Extensions to weighted or (certain) 
attributed graphs are straightforward. 
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